{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4c7436b8",
   "metadata": {},
   "source": [
    "# Estimate Consensus and Annotator Quality for Data Labeled by Multiple Annotators"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b432513",
   "metadata": {},
   "source": [
    "This 5-minute quickstart tutorial shows how to use cleanlab for classification data that has been labeled by *multiple* annotators (where each example has been labeled by at least one annotator, but not every annotator has labeled every example). Compared to existing crowdsourcing tools, cleanlab helps you better analyze such data by leveraging a trained classifier model in addition to the raw annotations. With one line of code, you can automatically compute:\n",
    "\n",
    "- A **consensus label** for each example (i.e. *truth inference*) that aggregates the individual annotations (more accurately than algorithms from crowdsourcing like majority-vote, Dawid-Skene, or GLAD).\n",
    "- A **quality score for each consensus label** which measures our confidence that this label is correct (via well-calibrated estimates that account for the: number of annotators which have labeled this example, overall quality of each annotator, and quality of our trained ML models).\n",
    "- An analogous **label quality score** for each individual label chosen by one annotator for a particular example (to measure our confidence in alternate labels when annotators differ from the consensus).\n",
    "- An **overall quality score for each annotator** which measures our confidence in the overall correctness of labels obtained from this annotator.\n",
    "\n",
    "**Overview of what we'll do in this tutorial:**\n",
    "\n",
    "- Obtain initial consensus labels of multiannotator data using majority vote.\n",
    "- Train a classifier model on the initial consensus labels and use it to obtain out-of-sample predicted class probabilities.\n",
    "- Use cleanlab's `multiannotator.get_label_quality_multiannotator` function to get improved consensus labels that more accurately reflect the ground truth.\n",
    "- View other information about your multiannotator dataset, such as consensus and annotator quality scores, agreement between annotators, detailed label quality scores and more!\n",
    "\n",
    "**Consensus labels** represent the best guess of the true label for each example and can be used for more reliable modeling/analytics. Cleanlab automatically produces enhanced estimates of consensus through the use of machine learning.\n",
    "**Quality scores** help us determine how much trust we can place in each: consensus label, individual annotator, and particular label from a particular annotator. These quality scores can help you determine which annotators are best/worst overall, as well as which current consensus labels are least trustworthy and should perhaps be verified via additional annotation. \n",
    "\n",
    "This tutorial uses a toy *tabular* dataset labeled with multiple annotators but **these steps can easily be applied to image or text data**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03385f84",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Quickstart\n",
    "<br/>\n",
    "    \n",
    "Already have `multiannotator_labels` and (out-of-sample) `pred_probs` from a model trained on an existing set of consensus labels? Run the code below to get improved consensus labels and more information about the quality of your labels and annotators.\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```ipython3 \n",
    "from cleanlab.multiannotator import get_label_quality_multiannotator\n",
    "\n",
    "get_label_quality_multiannotator(multiannotator_labels, pred_probs)\n",
    "\n",
    "```\n",
    "\n",
    "    \n",
    "</div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a48d31",
   "metadata": {},
   "source": [
    "## 1. Install and import required dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c6e5b15",
   "metadata": {},
   "source": [
    "You can use `pip` to install all packages required for this tutorial as follows:\n",
    "\n",
    "```ipython3\n",
    "!pip install cleanlab\n",
    "\n",
    "# Make sure to install the version corresponding to this tutorial\n",
    "# E.g. if viewing master branch documentation:\n",
    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3ddc95f",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Package installation (hidden on docs website).\n",
    "dependencies = [\"cleanlab\"]\n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
    "    %pip install cleanlab  # for colab\n",
    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
    "    %pip install $cmd\n",
    "else:\n",
    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
    "                         else dependency.split('<')[0] if '<' in dependency \n",
    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
    "    missing_dependencies = []\n",
    "    for dependency in dependencies_test:\n",
    "        try:\n",
    "            __import__(dependency)\n",
    "        except ImportError:\n",
    "            missing_dependencies.append(dependency)\n",
    "\n",
    "    if len(missing_dependencies) > 0:\n",
    "        print(\"Missing required dependencies:\")\n",
    "        print(*missing_dependencies, sep=\", \")\n",
    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd0148e6",
   "metadata": {},
   "source": [
    "Let’s import some of the packages needed throughout this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4efd119",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_predict\n",
    "\n",
    "from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "345b6678",
   "metadata": {},
   "source": [
    "## 2. Create the data (can skip these details)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82aeedc8",
   "metadata": {},
   "source": [
    "For this tutorial we will generate a toy dataset that has 50 annotators and 300 examples. There are three possible classes, `0`, `1` and `2`. \n",
    "\n",
    "Each annotator annotates approximately 10% of the examples. We also synthetically made the last 5 annotators in our toy dataset have much noisier labels than the rest of the annotators.\n",
    "\n",
    "Solely for evaluating cleanlab's consensus labels against other consensus methods, we here also generate the true labels for this example dataset. However, true labels are not required for any cleanlab multiannotator functions (and they usually are not available in real applications).\n",
    "To generate our multiannotator data, we define a `make_data()` method (can skip these details)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69b5ddaa",
   "metadata": {},
   "source": [
    "<details><summary>See the code for data generation **(click to expand)**</summary>\n",
    "    \n",
    "```ipython3\n",
    "# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "    \n",
    "from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace\n",
    "from cleanlab.benchmarking.noise_generation import generate_noisy_labels\n",
    "\n",
    "SEED = 111 # set to None for non-reproducible randomness\n",
    "np.random.seed(seed=SEED)\n",
    "\n",
    "def make_data(\n",
    "    means=[[3, 2], [7, 7], [0, 8]],\n",
    "    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],\n",
    "    sizes=[150, 75, 75],\n",
    "    num_annotators=50,\n",
    "):\n",
    "    \n",
    "    m = len(means)  # number of classes\n",
    "    n = sum(sizes)\n",
    "    local_data = []\n",
    "    labels = []\n",
    "\n",
    "    for idx in range(m):\n",
    "        local_data.append(\n",
    "            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])\n",
    "        )\n",
    "        labels.append(np.array([idx for i in range(sizes[idx])]))\n",
    "    X_train = np.vstack(local_data)\n",
    "    true_labels_train = np.hstack(labels)\n",
    "\n",
    "    # Compute p(true_label=k)\n",
    "    py = np.bincount(true_labels_train) / float(len(true_labels_train))\n",
    "    \n",
    "    noise_matrix_better = generate_noise_matrix_from_trace(\n",
    "        m,\n",
    "        trace=0.8 * m,\n",
    "        py=py,\n",
    "        valid_noise_matrix=True,\n",
    "        seed=SEED,\n",
    "    )\n",
    "    \n",
    "    noise_matrix_worse = generate_noise_matrix_from_trace(\n",
    "        m,\n",
    "        trace=0.35 * m,\n",
    "        py=py,\n",
    "        valid_noise_matrix=True,\n",
    "        seed=SEED,\n",
    "    )\n",
    "\n",
    "    # Generate our noisy labels using the noise_matrix for specified number of annotators.\n",
    "    s = pd.DataFrame(\n",
    "        np.vstack(\n",
    "            [\n",
    "                generate_noisy_labels(true_labels_train, noise_matrix_better)\n",
    "                if i < num_annotators - 5\n",
    "                else generate_noisy_labels(true_labels_train, noise_matrix_worse)\n",
    "                for i in range(num_annotators)\n",
    "            ]\n",
    "        ).transpose()\n",
    "    )\n",
    "\n",
    "    # Each annotator only labels approximately 10% of the dataset\n",
    "    # (unlabeled points represented with NaN)\n",
    "    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype(\"Int64\")\n",
    "    s.dropna(axis=1, how=\"all\", inplace=True)\n",
    "    s.columns = [\"A\" + str(i).zfill(4) for i in range(1, num_annotators+1)]\n",
    "\n",
    "    row_NA_check = pd.notna(s).any(axis=1)\n",
    "\n",
    "    return {\n",
    "        \"X_train\": X_train[row_NA_check],\n",
    "        \"true_labels_train\": true_labels_train[row_NA_check],\n",
    "        \"multiannotator_labels\": s[row_NA_check].reset_index(drop=True),\n",
    "    }\n",
    "```\n",
    "    \n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c37c0a69",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace\n",
    "from cleanlab.benchmarking.noise_generation import generate_noisy_labels\n",
    "\n",
    "SEED = 111 # set to None for non-reproducible randomness\n",
    "np.random.seed(seed=SEED)\n",
    "\n",
    "def make_data(\n",
    "    means=[[3, 2], [7, 7], [0, 8]],\n",
    "    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],\n",
    "    sizes=[150, 75, 75],\n",
    "    num_annotators=50,\n",
    "):\n",
    "    \n",
    "    m = len(means)  # number of classes\n",
    "    n = sum(sizes)\n",
    "    local_data = []\n",
    "    labels = []\n",
    "\n",
    "    for idx in range(m):\n",
    "        local_data.append(\n",
    "            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])\n",
    "        )\n",
    "        labels.append(np.array([idx for i in range(sizes[idx])]))\n",
    "    X_train = np.vstack(local_data)\n",
    "    true_labels_train = np.hstack(labels)\n",
    "\n",
    "    # Compute p(true_label=k)\n",
    "    py = np.bincount(true_labels_train) / float(len(true_labels_train))\n",
    "    \n",
    "    noise_matrix_better = generate_noise_matrix_from_trace(\n",
    "        m,\n",
    "        trace=0.8 * m,\n",
    "        py=py,\n",
    "        valid_noise_matrix=True,\n",
    "        seed=SEED,\n",
    "    )\n",
    "    \n",
    "    noise_matrix_worse = generate_noise_matrix_from_trace(\n",
    "        m,\n",
    "        trace=0.35 * m,\n",
    "        py=py,\n",
    "        valid_noise_matrix=True,\n",
    "        seed=SEED,\n",
    "    )\n",
    "\n",
    "    # Generate our noisy labels using the noise_matrix for specified number of annotators.\n",
    "    s = pd.DataFrame(\n",
    "        np.vstack(\n",
    "            [\n",
    "                generate_noisy_labels(true_labels_train, noise_matrix_better)\n",
    "                if i < num_annotators - 5\n",
    "                else generate_noisy_labels(true_labels_train, noise_matrix_worse)\n",
    "                for i in range(num_annotators)\n",
    "            ]\n",
    "        ).transpose()\n",
    "    )\n",
    "\n",
    "    # Each annotator only labels approximately 10% of the dataset\n",
    "    # (unlabeled points represented with NaN)\n",
    "    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype(\"Int64\")\n",
    "    s.dropna(axis=1, how=\"all\", inplace=True)\n",
    "    s.columns = [\"A\" + str(i).zfill(4) for i in range(1, num_annotators+1)]\n",
    "\n",
    "    row_NA_check = pd.notna(s).any(axis=1)\n",
    "\n",
    "    return {\n",
    "        \"X_train\": X_train[row_NA_check],\n",
    "        \"true_labels_train\": true_labels_train[row_NA_check],\n",
    "        \"multiannotator_labels\": s[row_NA_check].reset_index(drop=True),\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99f69523",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dict = make_data()\n",
    "\n",
    "X = data_dict[\"X_train\"]\n",
    "multiannotator_labels = data_dict[\"multiannotator_labels\"]\n",
    "true_labels = data_dict[\"true_labels_train\"] # used for comparing the accuracy of consensus labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a705e28",
   "metadata": {},
   "source": [
    "Let's view the first few rows of the data used for this tutorial. Here are the labels selected by each annotator for the first few examples (rows) in the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f241c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "multiannotator_labels.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a705e29",
   "metadata": {},
   "source": [
    "Here are the corresponding features for these examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f0819ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "X[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cb8131d",
   "metadata": {},
   "source": [
    "`multiannotator_labels` contains the class label that each annotator chose for each example in the dataset, with examples that a particular annotator did not label represented using `np.nan`. \n",
    "`X` contains the features for each example, which happen to be numeric in this tutorial but any feature modality can be used with ``cleanlab.multiannotator``."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "946726ad",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Bringing Your Own Data (BYOD)?\n",
    "\n",
    "You can easily replace the above with your own multiannotator labels and features, then continue with the rest of the tutorial.\n",
    " \n",
    "`multiannotator_labels` should be a numpy array or pandas DataFrame with each column representing an annotator and each row representing an example. Your labels should be represented as integer indices 0, 1, ..., num_classes - 1, where examples that are not annotated by a particular annotator are represented using `np.nan` or `pd.NA`. If you have string labels or other labels that do not fit the required format, you can convert them to the proper format using `cleanlab.internal.multiannotator_utils.format_multiannotator_labels`. \n",
    "    \n",
    "Your features can be represented however you like (since these are not inputs to `cleanlab.multiannotator` methods) as long as you are able to fit a classifer to them and obtain its predicted class probabilities! \n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51335def",
   "metadata": {},
   "source": [
    "## 3. Get initial consensus labels via majority vote and compute out-of-sample predicted probabilities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1857cc7",
   "metadata": {},
   "source": [
    "Before training a machine learning model, we must first obtain initial consensus labels from the data annotations representing a crude guess of the best label for each example. The most straight forward way to obtain an initial set of consensus labels is via simple majority vote."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d009f347",
   "metadata": {},
   "outputs": [],
   "source": [
    "majority_vote_label = get_majority_vote_label(multiannotator_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7287b733",
   "metadata": {},
   "source": [
    "Majority vote consensus labels may not be very reliable, particularly for examples that were only labeled by one or a few annotators. To more reliably estimate consensus, we can account for the features associated with each example (based on which the annotations were derived in the first place). Fitting a classifier model serves as a natural way to account for these feature values, here  we train a simple logistic regression model to get significantly more accurate estimates of consensus labels and associated quality scores.\n",
    "\n",
    "We fit the model with our initial consensus labels, and then get (out-of-sample) predicted class probabilities for each example in the dataset from the trained model. These predicted probabilities help us estimate the best consensus labels and associated confidence values in a statistically optimal manner that accounts for all the available information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbd1e415",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = LogisticRegression()\n",
    "\n",
    "num_crossval_folds = 5  \n",
    "pred_probs = cross_val_predict(\n",
    "    estimator=model, X=X, y=majority_vote_label, cv=num_crossval_folds, method=\"predict_proba\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4eab5188",
   "metadata": {},
   "source": [
    "## 4. Use cleanlab to get better consensus labels and other statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d392ce5",
   "metadata": {},
   "source": [
    "Using the annotators' labels and the (out-of-sample) predicted class probabilities from the model, cleanlab can estimate **improved consensus labels** for our data that are more accurate than our initial consensus labels were.\n",
    "\n",
    "Having accurate labels provides insight on each annotator's label quality and is key for boosting model accuracy and achieving dependable real-world results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ca92617",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = get_label_quality_multiannotator(multiannotator_labels, pred_probs, verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98042e7f",
   "metadata": {},
   "source": [
    "Here, we use the `multiannotator.get_label_quality_multiannotator()` function which returns a dictionary containing three items:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76d7c0e2",
   "metadata": {},
   "source": [
    "1. `label_quality` which gives us the improved consensus labels using information from each of the annotators and the model. The DataFrame also contains information about the number of annotations, annotator agreement and consensus quality score for each example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf945113",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "results[\"label_quality\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "984d65c4",
   "metadata": {},
   "source": [
    "2. `detailed_label_quality` which returns the label quality score for each label given by every annotator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14251ee0",
   "metadata": {},
   "outputs": [],
   "source": [
    "results[\"detailed_label_quality\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db02e63d",
   "metadata": {},
   "source": [
    "3. `annotator_stats` which gives us the annotator quality score for each annotator, alongisde other information such as the number of examples each annotator labeled, their agreement with the consensus labels and the class they perform the worst at. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efe16638",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "results[\"annotator_stats\"].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0d09bfa",
   "metadata": {},
   "source": [
    "The `annotator_stats` DataFrame is sorted by increasing `annotator_quality`, showing us the worst annotators first.\n",
    "\n",
    "Notice that in the above table annotators with ids A0046 to A0050 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20ca8dd2",
   "metadata": {},
   "source": [
    "### Comparing improved consensus labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b49657d",
   "metadata": {},
   "source": [
    "We can get the improved consensus labels from the `label_quality` DataFrame shown above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abd0fb0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "improved_consensus_label = results[\"label_quality\"][\"consensus_label\"].values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fd7a5fd",
   "metadata": {},
   "source": [
    "Since our toy dataset is synthetically generated by adding noise to each annotator's labels, we know the ground truth labels for each example. Hence we can compare the accuracy of the consensus labels obtained using majority vote, and the improved consensus labels obtained using cleanlab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdf061df",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "majority_vote_accuracy = np.mean(true_labels == majority_vote_label)\n",
    "cleanlab_label_accuracy = np.mean(true_labels == improved_consensus_label)\n",
    "\n",
    "print(f\"Accuracy of majority vote labels = {majority_vote_accuracy}\")\n",
    "print(f\"Accuracy of cleanlab consensus labels = {cleanlab_label_accuracy}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c20b2c9",
   "metadata": {},
   "source": [
    "We can see that the accuracy of the consensus labels improved as a result of using cleanlab, which not only takes the annotators' labels into account, but also a model to compute better consensus labels."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f82dd4d5",
   "metadata": {},
   "source": [
    "### Inspecting consensus quality scores to find potential consensus label errors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fddb5453",
   "metadata": {},
   "source": [
    "We can get the consensus quality score from the `label_quality` DataFrame shown above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08949890",
   "metadata": {},
   "outputs": [],
   "source": [
    "consensus_quality_score = results[\"label_quality\"][\"consensus_quality_score\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f150a08",
   "metadata": {},
   "source": [
    "Besides obtaining improved consensus labels, cleanlab also computes consensus quality scores for each example. The lower scores represent potential consensus label errors in the dataset.\n",
    "\n",
    "Here, we will extract 15 examples that have the lowest consensus quality score, and we can compare their average accuracy when compared to the true labels. We will also compute the average accuracy for the rest of the examples for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6948b073",
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted_consensus_quality_score = consensus_quality_score.sort_values()\n",
    "worst_quality = sorted_consensus_quality_score.index[:15]\n",
    "better_quality = sorted_consensus_quality_score.index[15:]\n",
    "\n",
    "worst_quality_accuracy = np.mean(true_labels[worst_quality] == improved_consensus_label[worst_quality])\n",
    "better_quality_accuracy = np.mean(true_labels[better_quality] == improved_consensus_label[better_quality])\n",
    "\n",
    "print(f\"Accuracy of 15 worst quality examples = {worst_quality_accuracy}\")\n",
    "print(f\"Accuracy of better quality examples = {better_quality_accuracy}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fdf4d91",
   "metadata": {},
   "source": [
    "We observe that the 15 worst-consensus-quality-score examples have a lower average accuracy compared to the rest of the examples. Cleanlab automatically determines which consensus labels are least trustworthy (perhaps want to have another annotator look at that data). Here we see these trustworthiness estimates really do correspond to the true quality of the consensus labels (which we know in this toy dataset because we have the true labels, unlike in your applications)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06cae16a",
   "metadata": {},
   "source": [
    "## 5. Retrain model using improved consensus labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d4e31ab",
   "metadata": {},
   "source": [
    "After obtaining the improved consensus labels, we can now retrain a better version of our machine learning model using these newly obtained labels. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f8e6914",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = LogisticRegression()\n",
    "\n",
    "num_crossval_folds = 5  \n",
    "improved_pred_probs = cross_val_predict(\n",
    "    estimator=model, X=X, y=improved_consensus_label, cv=num_crossval_folds, method=\"predict_proba\"\n",
    ")\n",
    "\n",
    "# alternatively, we can treat all the improved consensus labels as training labels to fit the model \n",
    "# model.fit(X, improved_consensus_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e59f7d4f",
   "metadata": {},
   "source": [
    "## Further improvements \n",
    "You can also repeat this process of getting better consensus labels using the model's out-of-sample predicted probabilities and then retraining the model with the improved labels to get even better predicted class probabilities in a virtuous cycle!\n",
    "For details, see our [examples](https://github.com/cleanlab/examples) notebook on [Iterative use of Cleanlab to Improve Classification Models (and Consensus Labels) from Data Labeled by Multiple Annotators](https://github.com/cleanlab/examples/blob/master/multiannotator_cifar10/multiannotator_cifar10.ipynb).\n",
    "\n",
    "If possible, the best way to improve your model is to collect additional labels for both previously annotated data and extra not-yet-labeled examples (i.e. *active learning*). To decide which data is most informative to label next, use `cleanlab.multiannotator.get_active_learning_scores()` rather than the methods from this tutorial. This is demonstrated in our examples notebook on [Active Learning with Multiple Data Annotators via ActiveLab](https://github.com/cleanlab/examples/blob/master/active_learning_multiannotator/active_learning.ipynb).\n",
    "\n",
    "While this notebook focused on analzying the labels of your data, cleanlab can also check your data features for various issues. Learn how to do this by following our [Datalab tutorials](../tutorials/datalab/index.html), except you do not need to pass in `labels` now that you've already analyzed them with this notebook (or you can provide `labels` to Datalab as the consensus labels estimated here).\n",
    "\n",
    "\n",
    "## How does cleanlab.multiannotator work?\n",
    "\n",
    "All estimates above are produced via the CROWDLAB algorithm, described in this paper that contains extensive benchmarks which show CROWDLAB can produce better estimates than popular methods like Dawid-Skene and GLAD:\n",
    "\n",
    "[CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators](https://arxiv.org/abs/2210.06812)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b806d2ea",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "if majority_vote_accuracy >= cleanlab_label_accuracy:  # check cleanlab has improved prediction accuracy\n",
    "    raise Exception(\"Cleanlab training failed to improve consensus label accuracy\")\n",
    "\n",
    "if worst_quality_accuracy > better_quality_accuracy: # check bad consensus quality score corresponds to bad consensus\n",
    "    raise Exception(\"Cleanlab consensus quality score failed to detect bad consensus labels\")\n",
    "    \n",
    "annotator_stats = results[\"annotator_stats\"]\n",
    "bad_annotator_idx = [\"A0046\", \"A0047\", \"A0048\", \"A0049\", \"A0050\"]\n",
    "bad_annotator_mask = annotator_stats.index.isin(bad_annotator_idx)\n",
    "\n",
    "avg_annotator_quality_bad = np.mean(annotator_stats[bad_annotator_mask][\"annotator_quality\"])\n",
    "avg_annotator_quality_good = np.mean(annotator_stats[~bad_annotator_mask][\"annotator_quality\"])\n",
    "\n",
    "if avg_annotator_quality_bad >= avg_annotator_quality_good: # check bad annotator get bad quality scores \n",
    "    raise Exception(\"Low quality annotators have higher quality scores than good quality annotators\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "50292dbb1f747f7151d445135d392af3138fb3c65386d17d9510cb605222b10b"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
